-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(academy): add advanced crawling section with sitemaps and search #1217
base: master
Are you sure you want to change the base?
Conversation
Will process the lint issues soon |
@TC-MO If we change URL of an article, do I need to contact web team to set a hard redirect? |
I think we do redirects in nginx.conf file not sure if there is any other way |
TODO redirect |
@@ -9,6 +9,16 @@ import Example from '!!raw-loader!roa-loader!./scraping_from_sitemaps.js'; | |||
|
|||
# How to scrape from sitemaps {#scraping-with-sitemaps} | |||
|
|||
>Crawlee recently introduced a new feature that allows you to scrape sitemaps with ease. If you are using Crawlee, you can skip the following steps and just gather all the URLs from the sitemap in a few lines of code: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like unnecesary dating ? When is "recently" ? Also I think this could work better as admonitions instead of blockquote.
--- | ||
title: Crawling sitemaps | ||
description: Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper. | ||
menuWeight: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something that is custom created by Apify? I haven't seen this anywhere else
title: Crawling sitemaps | ||
description: Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper. | ||
menuWeight: 2 | ||
paths: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this supposed to be slug:
?
|
||
Apify provides the [Sitemap Sniffer actor](https://apify.com/vaclavrut/sitemap-sniffer) (open-source code), that scans the URL variations automatically for you so that you don't have to check manually. | ||
|
||
## [](#how-to-set-up-http-requests-to-download-sitemaps) How to set up HTTP requests to download sitemaps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if anchors do not differ from headings then these are unnecessary from what I remember
|
||
Fortunately, you don't have to worry about any of the above steps if you use [Crawlee](https://crawlee.dev) which has rich traversing and parsing support for sitemap. Crawlee can traverse nested sitemaps, download, and parse compressed sitemaps, and extract URLs from them. You can get all URLs in a few lines of code: | ||
|
||
```javascript |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we switch it to ```js? Sometime back we changed this for consistency across Academy & Platform docs. I'll add this info to contributing guidelines.
- advanced-web-scraping/crawling/sitemaps-vs-search | ||
--- | ||
|
||
The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the Web Scraping for Beginners course. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure if home page
is correct, perhaps @TheoVasilis could weigh in? Home page or homepage?
|
||
The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the Web Scraping for Beginners course. | ||
|
||
Unfortunately, **most modern websites restrict pagination** only to somewhere between 1 and 10 thousand products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, **most modern websites restrict pagination** only to somewhere between 1 and 10 thousand products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson. | |
Unfortunately, **most modern websites restrict pagination** only to somewhere between 1 and 10 000 products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson. |
category: web scraping & automation | ||
slug: /advanced-web-scraping | ||
paths: | ||
- advanced-web-scraping | ||
--- | ||
|
||
# Advanced web scraping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If title in frontmatter does not differ from h1, h1 is unnecessary it will be automatically generated by docusaurus
--- | ||
In this course, we will take all of that knowledge, add a few more advanced concepts, and apply them to learn how to build a production-ready web scraper. | ||
|
||
## [](#what-does-production-ready-mean) What does production-ready mean? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly, headers should not use punctuation
Will review, but I think I will wait for @TC-MO's comments to be addressed first. |
No description provided.